Let's say we have p^=NX=1000100=0.1 where X= # of users who clicked, and N= # of users.
A Rule of Thumb for normality is to check N⋅p^>5 (might as well check N⋅(1−p^)>5), otherwise use t-distribution instead of z-distribution.
The margen of errorm=zα/2⋅SE=zα/2⋅Np^(1−p^), notice that here SE=Np(1−p) instead of np(1−p) for binomial distribution, since we use the fraction or proportion of successes instead of the total number of sucesses.
For α=5%, we have m=z0.025⋅10000.1∗0.9=0.019, and the final 95% CI is [0.081,0.119].
1.17 Null and Alternative Hypothesis, Two-tailed vs. One-tailed tests
The null hypothesis and alternative hypothesis proposed here correspond to a two-tailed test, which allows you to distinguish between three cases:
A statistically significant positive result
A statistically significant negative result
No statistically significant difference.
Sometimes when people run A/B tests, they will use a one-tailed test, which only allows you to distinguish between two cases:
A statistically significant positive result
No statistically significant result
Which one you should use depends on what action you will take based on the results.
If you're going to launch the experiment for a statistically significant positive change, and otherwise not, then you don't need to distinguish between a negative result and no result, so a one-tailed test is good enough. If you want to learn the direction of the difference, then a two-tailed test is necessary.
1.19 Pooled Standard Error
We have Xcont,Xexp,Ncont,Nexp, and
Pooled sample mean p^pool=Ncont+NexpXcont+Xexp
Pooled sample standard error SEpool=p^pool(1−p^pool)(Ncont1+Nexp1)
Test statistic d^=p^exp−p^cont
Null hypothesis H0:d=0, under which d^∼N(0,SEpool)
For 95% confidence level (z1−0.05/2=1.96), if d^>1.96×SEpool or d^<−1.96×SEpool, reject the null.
1.21 - 24. Sample Size and Power
Two types of error
α=P(reject null | null True)
β=P(not reject null | null False)
So if sample is small, we have low α and high β, i.e., harder to identify the alternative when a difference exists. On the other hand, if sample is large, α is the same, but β is much lower, as shown below.
Sample size = 1000
Sample size = 5000
1−β is called sensitivity and often choose to be >80%
Note on power
Statistical textbooks often define power as the sensitivity. However, conversationally power often means the probability that your test draws the correct conclusions, which depends on both α and β.
Required sample size to achieve certain statistical power can be calculated using online calculator, in which you need to specify α, β, baseline conversion rate (null), and minimum detectable effect (alternative).
Final notes on how typer I & II confidence level and detectable difference dmin can determine the required sample size together is as follows
Examples of factors that affect the required sample size are as follows:
1.25 Pooled Example
An pooled example is shown below, notice how the dmin works (need the lower bond of the 1−α level CI >dmin=0.02)
1.26 Confidence Interval Case Breakdown
Shown below is the how we should consider the decision under varying CI and dmin cases
Lesson 2: Policy and Ethics for Experiments
2.1 - 2.7. Four Principles
IRB's four main principles to consider when conducting experimentats are:
Risk: what risk is the participant undertaking?. The main threshold is whether the risk exceeds that of “minimal risk”. Minimal risk is defined as the probability and magnitude of harm that a participant would encounter in normal daily life.
Benefit: what benefits might result from the study?
Choice/Alternatives: what other choices do participants have?
Privacy/Data Sensitivity: what data is being collected, and what is the expectation of privacy and confidentiality?
How sensitive is the data?
What is the re-identification risk of individuals from the data?
2.8 Accessing Data Sensitivity
An example of data sensitivity assessment is shown below
2.10 Summary of Principles
It's a grey area whether internet studies should be subject to IRB review or not and whether informed consent is required.
Most studies face the bigger question about data collection with regards to identifiability, privacy, and confidentiality / security.
Are participants facing more than minimal risk?
Do participants understand what data is being gathered?
Is that data identifiable?
How is the data handled?
Lesson 3: Choosing and Characterizing Metrics
3.2 - 3.3 Metric Definition Overview
Invariant Checking: metrics shouldn't change across experiment and control
Evaluation: what do we want to use the metrics for?
At the evaluation stage, it's better to settle on one single objective that multiple departments within the company would most likely agree on.
If mutlple metrics are available or equally important, we can create a composite metric, e.g., something called objective function or OEC (Overall Evaluation Criterion, a term created by Microsoft).
Composite metric is less preferred, as it is better to come up with a less optimal metric that works for a suite of A/B tests than to come up with a perfect metric but only for a single test.
3.5 Refining the Customer Funnel
An example of defining metrics for Udacity
Refining the customer funnel
High-level metrics
3.6 - 3.7 Quizes on Choosing Metrics
How to choose metrics for different tests
Difficult metrics
Don't have access to data, e.g.,
Amazon wants to measure average happiness of shoppers
Google wants to measure probability of user finding information via search
Takes too long to measure, e.g.,
Udacity measures the rate of customers who completed the 1st course returning for 2nd one.
3.8 Other techniques for defining metrics
External data
User experience research, surveys, focus groups
Retrospective analysis helps detect correlations for us to develop theories.
3.10 - 11 Techniques to Gather Additional Data and Examples
Techniques for gather additional data
Udacity example
Examples where data is hard to get
3.13 Metric Definition: Click Through Example
Metric definition
3.16 - 3.17 Summary Metrics
Categories of summary metrics
Sums and counts.
e.g., # users who visited page
Means, medians, and percentiles
e.g., mean age of users who completed a course or
median latency of page load
Probabilities and rates
Probability has 0 or 1 outcome in each case
Rate has 0 or more
Ratios
e.g., P(any click)P(revenue-generating click)
3.18 - 3.19 Sensitivity and Robustness
We want summary metrics to be sensitive on things we care and robust on things we don't care.
Example: choose summary metric for latency of a video
Use retrospective analysis to check robustness. For example, if we plot distribution for similar videos and find the 95th and 99th percentiles of load time has noticeable variations between videos, those two metrics may not be robust enough.
We can also look at experimental data. For example, if we plot distribution of load time for videos with increasing resolution, and find that the median and 80th percentile is not affected by resolution, only the 85/90/95-th percentiles are increasing. This means that median and 80th percentile may not be sensitive enough.
3.20 Absolute Versus Relative Differences
Usually start with absolute differences when we don't know the metric well.
Using relative difference means we might be able to stick with the same significance boundary and not need to worry about seasonality factors (e.g., think about CTR for shopping websites)
Some summary metrics may be harder to analyze. E.g., median could be non-normal if data is non-normal (e.g., latency with bimodal distribution shown below)
Example: calculate the 95% CI for a mean with N = [87029, 113407, 84843, 104994, 99327, 92052, 60684]
Estimate variance and calculate CI using pooled results
Directly estimate confidence interval from empirical distribution
We can also use bootstrap to generate multiple samples/metrics to estiamte the variability.
Lesson 4: Designing an Experiment
4.2 - 4.3 Unit of Diversion Overview
Unit of diversion is how we define what an individual subject is in the experiment.
Commonly used:
User id
Stable, unchanging
Personally identifiable
Anonymous id (cookie)
Changes when you switch browser or device
Users can clear cookies
Event
No consistent experience
use only for non-user-visible changes
Less common:
Device id
only available for mobile
tied to specific device
unchangeable by user
IP address
changes when location changes
Example
4.4 - 4.5 Consistency of Diversion
First principle of choosing unit of diversion is to make sure users have consistent experience.
If the customer wouldn't be likely to notice the change, we might want to start with event-based experiement. If learning effect is detected later, we can switch to a cookie-based experiment.
Example
4.6 - 4.7 Ethical Considerations
An exmaple is as follows.
Notice that only the second case requires additional sthical review/consent from the user because it might comprimise the anonimity of cookie-based data.
4.8 - 4.9 Unity of Analysis vs. Diversion
Unit of analysis is basically whatever your denominator of the analysis is.
In an interleaved ranking experiment, suppose you have two ranking algorithms, X and Y. Algorithm X would show results X1,X2,…XN in that order, and algorithm Y would show Y1,Y2,…YN. An interleaved experiment would show some interleaving of those results, for example, X1,Y1,X2,Y2,… with duplicate results removed. One way to measure this would be by comparing the click-through-rate or -probability of the results from the two algorithms. For more detail, see Large-Scale Validation and Analysis of Interleaved Search Evaluation.